Los siguientes métodos me proporcione un a manera de comparar modelos o mirar su acertividad individual. Los dos primeros métodos son usados a la hora de clasificar y el último a la hora de regresar.

Matriz de confusión

Con el propósito de medir la calidad del modelo, después de realizar la validación cruzada podemos construir una matriz que nos permite visualizar de forma sencilla que tan bien clasificó nuestro modelo.

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.2

base <- read.csv("../Bases de datos/college-perf.csv")
kable(head(base),"markdown")

SAT	GPA	Projects	Community	Income	Perf	Pred
1380	2.53	1	0	41800	Low	Low
1100	3.18	1	5	37600	Low	Low
1110	2.73	2	10	34800	Medium	Medium
1180	2.49	3	0	24100	Low	High
1240	2.89	3	5	56000	Medium	Medium
1140	2.85	2	0	50800	Low	Low

tabla <- table(base$Perf, base$Pred, dnn = c("Actual", "predicho"))
kable(tabla, "markdown")

	High	Low	Medium
High	458	35	38
Low	98	1150	84
Medium	170	166	1801

Nota:La diagonal principal me muestra las casos donde se acertó y los valores que estén fuera de esta muestra la información que está herrada.

Si queremos tener la información aterior representado como una proposión.

kable(prop.table(tabla),"markdown")

	High	Low	Medium
High	0.1145	0.00875	0.00950
Low	0.0245	0.28750	0.02100
Medium	0.0425	0.04150	0.45025

barplot(tabla, col = c(4,5,6), legend = T, main = "Matriz de confusión de forma gráfica")

Curva ROC

La curva ROC representa la capacidad de un clasificador binario para distinguir entre categorias. Con una curva ROC se busca la viabilidad en nuestra clasificación, en otras palabras el modelo que ofrezca una mayor área bajo la curva presenta una mejor tasa de predicción. Además me proporciona una forma de encontrar un punto de corte para una clasificación.

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.5.2

## Warning: package 'gplots' was built under R version 3.5.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

base1 <- read.csv("../Bases de datos/roc-example-1.csv")
kable(head(base1),"markdown")

prob	class
0.9917340	1
0.9768288	1
0.9763148	1
0.9601505	1
0.9351574	1
0.9335989	1

pred1 <- prediction(base1$prob, base1$class)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1)
lines(par()$usr[1:2], par()$usr[3:4])

El siguiente gráfico presenta de forma intuitiva la utilidad de la curva ROC y la importancia a la hora de encontrar un punto de corte de categorías óptimo.

r1 <- base1[base1$class ==1,]
r1 <- r1[1:46,]
r2 <- base1[base1$class ==0,]
ggplot(r1, aes(r1$prob))+geom_density(aes(fill ="red" ),alpha = 0.5) + geom_density(aes(r2$prob, fill ="purple" , alpha = 0.5), show.legend = F)+ labs(x = "Punto de corte") + scale_fill_discrete(name = "Categoria",labels = c("Enfermos","No enfermos","corte"))+ geom_vline(aes(xintercept = 0.5, color = "red"))+geom_vline(aes(xintercept = 0.8, color = "red"))+ geom_vline(aes(xintercept = 0, color = "red"))+ geom_vline(aes(xintercept = 1, color = "red")) + theme_bw(base_family = "Courier")

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

prob.corte <- data.frame(cut = perf1@alpha.values[[1]],
                          fpr = perf1@x.values[[1]],
                          tpr = perf1@y.values[[1]])

Observamos algunos puntos de corte y sus tasas de verdaderos positivos y falsos positivos.

kable(head(prob.corte),"markdown")

cut	fpr	tpr
Inf	0	0.0000000
0.9917340	0	0.0185185
0.9768288	0	0.0370370
0.9763148	0	0.0555556
0.9601505	0	0.0740741
0.9351574	0	0.0925926

Elegimos una tasa de verdaderos positivos mayor a 0.8

kable(head(prob.corte[prob.corte$tpr>=0.8,]),"markdown")

	cut	fpr	tpr
55	0.4981506	0.2173913	0.8148148
56	0.4961696	0.2173913	0.8333333
57	0.4784074	0.2391304	0.8333333
58	0.4775468	0.2608696	0.8333333
59	0.4632342	0.2826087	0.8333333
60	0.4522735	0.2826087	0.8518519

Error cuadrático medio

Este método es útil para comparar modelos de regresión. Se busca que este error sea lo más pequeño posible.

La formula para este error esta dada por :

\[ECM = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2\]

mean((y_real-y_predicho)^2)

Curva ROC, matriz de confusión y Error cuadrático medio.

Yubar Daniel Marín Benjumea

https://unalyticsteam.github.io/unalytics.github.io/

Matriz de confusión

Curva ROC

Error cuadrático medio